Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Nov 14, 2025


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:898eb1c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-898eb1c-python \
  ghcr.io/openhands/agent-server:898eb1c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:898eb1c-golang-amd64
ghcr.io/openhands/agent-server:898eb1c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:898eb1c-golang-arm64
ghcr.io/openhands/agent-server:898eb1c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:898eb1c-java-amd64
ghcr.io/openhands/agent-server:898eb1c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:898eb1c-java-arm64
ghcr.io/openhands/agent-server:898eb1c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:898eb1c-python-amd64
ghcr.io/openhands/agent-server:898eb1c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:898eb1c-python-arm64
ghcr.io/openhands/agent-server:898eb1c-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:898eb1c-golang
ghcr.io/openhands/agent-server:898eb1c-java
ghcr.io/openhands/agent-server:898eb1c-python

About Multi-Architecture Support

  • Each variant tag (e.g., 898eb1c-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 898eb1c-python-amd64) are also available if needed

@simonrosenberg simonrosenberg self-assigned this Nov 14, 2025
@simonrosenberg simonrosenberg marked this pull request as ready for review November 17, 2025 10:48
@simonrosenberg simonrosenberg marked this pull request as draft November 17, 2025 13:09
@simonrosenberg simonrosenberg removed the request for review from xingyaoww November 17, 2025 13:09
@simonrosenberg simonrosenberg marked this pull request as ready for review November 17, 2025 13:31
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 23, 2025

[Automatic Post]: It has been a while since there was any activity on this PR. @simonrosenberg, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@simonrosenberg
Copy link
Collaborator Author

@OpenHands /codereview-roasted
Clone the repos https://github.com/OpenHands/evaluation https://github.com/OpenHands/benchmarks/issues
and read the diff of this PR to understnad how it triggers the eval workflow for software-agent-sdk. The review the code in software-agent-sdk. Suggest any simplification and report any bug

@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

I'm on it! simonrosenberg can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

Code Review Summary: PR #1167 - Eval Workflow Integration

I've completed a comprehensive "Linus-style" code review of the PR that adds evaluation job triggering to the software-agent-sdk CI pipeline. Here's what I found:

How the Workflow Works

The PR adds a GitHub Actions workflow (.github/workflows/run-eval.yml) that:

  1. Triggers via PR labels (run-eval-1/2/50/100), releases, or manual dispatch
  2. Validates the triggering user against an allowlist and model stubs against another allowlist
  3. Builds SWE-Bench Docker images using the OpenHands/benchmarks repo
  4. Dispatches the actual evaluation to OpenHands/evaluation repo's workflow
  5. Comments on the PR with evaluation details

🔴 CRITICAL ISSUES (Must Fix)

1. Security Risk: Using eval command (Lines 204-216)

  • The workflow constructs a shell command as a string then uses eval to execute it
  • This is dangerous and completely unnecessary since EVAL_LIMIT is always set
  • Fix: Remove the conditional flag logic and call the Python script directly with --n-limit "${EVAL_LIMIT}"

2. Bug: Incorrect SHA substring logic (Line 256)

  • SDK_SHA_SHORT=${SDK_SHA:0:40} extracts 40 chars, but SHA-1 is already 40 chars, so this does nothing
  • GitHub's API accepts full SHAs anyway, making this pointless
  • Fix: Remove the substring operation and use $SDK_SHA directly

3. Missing Error Handling: Dispatch call (Lines 236-240)

  • The curl command to dispatch the evaluation workflow has no error checking
  • If it fails, the workflow continues and reports success
  • Fix: Check the HTTP response code (should be 204) and fail if the dispatch fails

4. Data Flow Issue: Double checkout (Lines 68-71 + 151-157)

  • First checks out the base branch to validate allowlists
  • Then checks out the PR head to build images
  • This creates confusion about which ref is being evaluated
  • Improvement: Clarify this two-step checkout with comments or restructure

⚠️ IMPROVEMENT OPPORTUNITIES

  1. Complex parameter resolution (Lines 94-149): Single bash script handles 3 trigger types with nested conditionals - works but could be cleaner with better comments
  2. Model validation complexity (Lines 132-143): Mixes bash string processing with jq - should pick one approach
  3. Release-to-PR lookup may fail (Lines 254-262): Assumes releases always map to PRs, which isn't guaranteed
  4. Future-dated model names: The allowlist includes models dated 2025, which may become stale

✅ WHAT'S GOOD

  • Using pull_request_target with allowlist validation is the correct security approach
  • The overall architecture (build images, then dispatch eval) makes sense
  • Proper use of set -euo pipefail in most bash blocks
  • Clear separation of allowlist files

VERDICT: ❌ Needs rework before merging

The eval security issue and missing error handling on the dispatch call must be fixed. The SHA substring bug should also be addressed as it serves no purpose and demonstrates a misunderstanding of how Git SHAs work.

No code changes were made per the code review instructions - this is feedback only. The PR author should address these issues before merging.

View full conversation

@openhands-ai
Copy link

openhands-ai bot commented Nov 24, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • .github/workflows/run-eval.yml

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1167 at branch `add-eval-job-run-on-CI`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@simonrosenberg simonrosenberg merged commit 4f0cc50 into main Nov 24, 2025
21 checks passed
@simonrosenberg simonrosenberg deleted the add-eval-job-run-on-CI branch November 24, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants